Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

Part 0: Get to Know the Data

There are four data files associated with this project:

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. One of them is a top-level list of attributes and descriptions, organized by informational category. The other is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the .csv data files in this project that they're semicolon (;) delimited, so an additional argument in the read_csv() call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

Own Remarks on the Data-Exploration part

⇒ the Object-type seems to be the reason for the warning (later more)

Analyze LNR

⇒ LNR is unique

Use LNR as index

Transform the DIAS META EXCEL (DIAS Attributes - Values 2017.xlsx) to Dictionary

Dictionary Structure

Check attribute containment among the dataframes and between te dataframes and the DIAS-documentation

Overview of column counts

Check columns azdias vs. customers

⇒ All other column names are the same in cutomers and azdias. So the three additional columns in customer can be removed.

Remove unneeded columns in customers

Check columns mailout_train vs. mailout_test

⇒ Apart from the RESPONSE column the columns in mailout_train and mailout_test are identical

Check columns mailout_test vs. azdias

⇒ All other column names are the same in all the Dataframes

Check DIAS documentation vs. customers

42 Attributes in the DIAS Meta-Values are not contained in customers (and as a consequence not in the others...)

that means customers and documentation have only 314-42 = 272 common Atributes

Suspects:

Remove _RZ-suffix from the Attribute name in the DIAS documentation

⇒ we have 9 unmapped columns left now

Manual Review shoed we can explicitly rename some

⇒ we only have 5 unmapped attributes left (and that is ok right now) So documentation and customer have 309 common Attributes.

Check attributes not in DIAS documentation

⇒ we have 56 undocumented attributes

Check key-types of the items in documentation

Result:

Are there non numeric Columns in the provided DataFrames?

⇒ We have the basic types int, float, object in customers/azdias/mailout_...

Investigate the Object-type columns

⇒ The Columns 'CAMEO_DEU_2015', 'CAMEO_DEUG_2015', 'CAMEO_INTL_2015', 'D19_LETZTER_KAUF_BRANCHE', 'EINGEFUEGT_AM' and 'OST_WEST_KZ' have dtype object in all provided DataFrames

⇒ The column 'EINGEFUEGT_AM' can be deleted

⇒ The column 'MIN_GEBAEUDEJAHR' can be deleted

Explore the Values of the other Object-columns

After looking up the Results in the EXCEL we can sum the things up.

Summary

Value Ranges - do the values in the Dataframes conform to the DIAS Documentation EXCEL ?

Remark: If the value Ranges conform to the EXCEL then we can conclude that the renaming of the column names in the DIAS documentation was ok.

Clean the columns

we will apply thze findings from above and clean some columns

Check some concrete columns

CAMEO_DEU_2015, CAMEO_INTL_2015, CAMEO_DEUG_2015

CAMEO_DEU_2015

Lookup the feature in the Documentation

CAMEO_INTL_2015

⇒ CAMEO_DEU_2015, CAMEO_DEUG_2015 are highly correlated (correlation=0.9937104221123124)

⇒ CAMEO_INTL_2015, CAMEO_DEUG_2015 are also highly correlated (correlation=0.9555875475962049)

⇒ CAMEO_DEU_2015, CAMEO_INTL_2015 are also highly correlated (correlation=0.961882020658234)

⇒ CAMEO_DEU_2015 can be removed as it is higly correlated to 'CAMEO_DEUG_2015' otherwise one hot encoding would be necessary.

Remove the auxiliary columns from the dataframe

D19_LETZTER_KAUF_BRANCHE

⇒ D19_UNBEKANNT (meaning 'D19_UNKNOWN') needs to be treated as unknown value so replaced by nan.

PRAEGENDE_JUGENDJAHRE

⇒ The correlation to GEBURTSJAHR is high (correlation=0.9427983958025491)

GEBURTSJAHR >= 1975:

⇒ All in all the value range of this feature is too narrow. Delete the attribute!

EINGEZOGENAM_HH_JAHR

Meaning:

⇒ Delete EINGEZOGENAM_HH_JAHR

LP_LEBENSPHASE_FEIN

⇒ Attribute can be removed

LP_FAMILIE_FEIN

⇒ Attribute can be removed

LP_FAMILIE_GROB

Let's check if LP_FAMILIE_GROB is ok

Suspicious! The feature 'LP_FAMILIE_GROB' uses only the values 1 to 5 that means we have only singles, single couples or single parents in our data. At the same time the feature 'LP_FAMILIE_FEIN' uses the whole value range as described in the documentation. And we have observes the high correlation of the features.

⇒ All in all it looks like the documentation of 'LP_FAMILIE_GROB' is wrong.

Let's check a bit more to get sure:

Now we can be pretty sure that the documentation is wrong regarding feature 'LP_FAMILIE_GROB'. The documentation must be updated:

    1: 'single',
    2: 'couple',
    3: 'single parent',
    4: 'family',
    5: 'multiperson household',

LP_STATUS_FEIN

⇒ Attribute can be removed

LP_STATUS_GROB

Check values in documentation

Check the used values

And now LP_STATUS_FEIN for AZDIAS

Suspicious! The feature 'LP_STATUS_GROB' uses only the values 1 to 5 that means we have only low income or average earners in our data. At the same time the feature 'LP_STATUS_FEIN' uses the whole value range as described in the documentation. And we have observes the high correlation of the features.

⇒ All in all it looks like the documentation of 'LP_STATUS_GROB' is wrong.

Let's check a bit more to get sure:

Now we can be pretty sure that the documentation is wrong regarding feature 'LP_STATUS_GROB'. The documentation must be updated:

    1: 'low-income earners',
    2: 'average earners',
    3: 'independants',
    4: 'houseowners',
    5: 'top earners',

ALTERSKATEGORIE_FEIN

⇒ use it to reconstruct GEBURTSJAHR and delete it afterwards

ALTERSKATEGORIE_GROB

⇒ remove ALTERSKATEGORIE_GROB

KBA13_HERST_BMW_BENZ vs. KBA13_BMW, KBA13_MERCEDES

⇒ remove KBA13_BMW, KBA13_MERCEDES and keep KBA13_HERST_BMW_BENZ

KBA13_HERST_AUDI_VW vs. KBA13_AUDI, KBA13_VW

⇒ remove KBA13_AUDI, KBA13_VW and keep KBA13_HERST_AUDI_VW

KBA13_HERST_FORD_OPEL vs. KBA13_OPEL, KBA13_FORD

⇒ remove KBA13_OPEL, KBA13_FORD and keep KBA13_HERST_FORD_OPEL

KBA13_HERST_EUROPA vs. KBA13_FIAT, KBA13_PEUGEOT, KBA13_RENAULT

⇒ remove KBA13_FIAT, KBA13_PEUGEOT, KBA13_RENAULT and keep KBA13_HERST_EUROPA

KBA13_HERST_ASIEN vs. KBA13_MAZDA, KBA13_NISSAN, KBA13_TOYOTA

⇒ remove KBA13_FIAT, KBA13_NISSAN, KBA13_TOYOTA and keep KBA13_HERST_ASIEN for the sake of consistency

KBA13_ALTERHALTER_30 vs. KBA13_HALTER_20, KBA13_HALTER_25, KBA13_HALTER_30

⇒ remove KBA13_HALTER_20, KBA13_HALTER_25, KBA13_HALTER_30 and keep KBA13_ALTERHALTER_30

KBA13_ALTERHALTER_45 vs. KBA13_HALTER_35, KBA13_HALTER_40, KBA13_HALTER_45

⇒ remove KBA13_HALTER_35, KBA13_HALTER_40, KBA13_HALTER_45 and keep KBA13_ALTERHALTER_45

KBA13_ALTERHALTER_60 vs. KBA13_HALTER_50, KBA13_HALTER_55, KBA13_HALTER_60

⇒ remove KBA13_HALTER_50, KBA13_HALTER_25, KBA13_HALTER_60 and keep KBA13_ALTERHALTER_60

KBA13_ALTERHALTER_61 vs. KBA13_HALTER_65, KBA13_HALTER_66

⇒ remove KBA13_HALTER_65, KBA13_HALTER_66 and keep KBA13_ALTERHALTER_61

KBA05_CCM... vs KBA05_KW...

⇒ we could take in regard to remove the KBA05_CCM... attribute but let's keep it for now...

KBA13CCM... vs. KBA13_KW... vs. KBA13_CCM...

KBA13_CCM... vs KBA13_KW...

KBA13_KMH... vs KBA13_KW...

Analogous conclusion like above

Analyze the Aggregation

Check columns with numeric datatypes (non categorical) from Documentation

As we can see in the documentation most columns are already label encoded that means categorical. We interested to find pure numerical columns. First we start with the ones documented as numerical.

Look in CUSTOMERS

Look in AZDIAS

Find and check the undocumented numeric attributes

Now we want to find the undocumented attributes. Therefore wen examine the histograms.

Look in CUSTOMERS

Look in AZDIAS

Findings

Prefix 'ANZ_' is an Abbreviation for 'Anzahl' meaning number (of ...) The undocumented attributes identified as numeric attributes are:

Binary Attributes

Outliers

To detect and fix outliers we apply Tukeys fence method based on the inter quartile range...

Tukey

Features with low variance (or equivalent low stddev)

Removing features with low standard is not necenecessary because PCA cares about those features. The reason for this action is that we get exceptions when applying an IterativeImputer with the min_value and max_value parameters bcause removal of the outliers makes some features constant.

Experiment try it out - remove outliers

We build some code to remove outliers and execute it to see the influence on the Standard deviation of features

⇒ after removing outliers the feature ALTER_KIND has constant values and this can be a problem when

Missing values

Correlation of approx. 0.9 justifies to sort the dataframe by one attribute to make visualization easier

Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link here, you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.